Overview

Dataset Statistics

Number of Variables 6
Number of Rows 404290
Missing Cells 3
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 103.9 MB
Average Row Size in Memory 269.5 B
Variable Types
  • Numerical: 3
  • Categorical: 3

Dataset Insights

id is uniformly distributed Uniform
qid1 and qid2 have similar distributions Similar Distribution
qid2 is skewed Skewed
question1 has a high cardinality: 290456 distinct values High Cardinality
question2 has a high cardinality: 299174 distinct values High Cardinality
is_duplicate has constant length 1 Constant Length

Variables


id

numerical

Approximate Distinct Count 404290
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6468640
Mean 202144.5
Minimum 0
Maximum 404289
Zeros 1
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • id is uniformly distributed

Quantile Statistics

Minimum 0
5-th Percentile 20214.45
Q1 101072.25
Median 202144.5
Q3 303216.75
95-th Percentile 384074.55
Maximum 404289
Range 404289
IQR 202144.5

Descriptive Statistics

Mean 202144.5
Standard Deviation 116708.6145
Variance 1.3621e+10
Sum 8.1725e+10
Skewness 4.7162e-16
Kurtosis -1.2
Coefficient of Variation 0.5774
  • id is not normally distributed (p-value 0.00045385139444316163)

qid1

numerical

Approximate Distinct Count 290654
Approximate Unique (%) 71.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6468640
Mean 217243.9424
Minimum 1
Maximum 537932
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • qid1 is skewed right (γ1 = 0.3797)

Quantile Statistics

Minimum 1
5-th Percentile 11290.9
Q1 74437.5
Median 192182
Q3 346573.5
95-th Percentile 496522.55
Maximum 537932
Range 537931
IQR 272136

Descriptive Statistics

Mean 217243.9424
Standard Deviation 157751.7
Variance 2.4886e+10
Sum 8.783e+10
Skewness 0.3797
Kurtosis -1.0906
Coefficient of Variation 0.7262
  • qid1 is not normally distributed (p-value 2.7459788513380145e-06)

qid2

numerical

Approximate Distinct Count 299364
Approximate Unique (%) 74.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 6468640
Mean 220955.6553
Minimum 2
Maximum 537933
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • qid2 is skewed right (γ1 = 0.3454)

Quantile Statistics

Minimum 2
5-th Percentile 11264
Q1 74727
Median 197052
Q3 354692.5
95-th Percentile 499507.2
Maximum 537933
Range 537931
IQR 279965.5

Descriptive Statistics

Mean 220955.6553
Standard Deviation 159903.1826
Variance 2.5569e+10
Sum 8.933e+10
Skewness 0.3454
Kurtosis -1.1426
Coefficient of Variation 0.7237
  • qid2 is not normally distributed (p-value 1.5241232640776062e-07)

question1

categorical

Approximate Distinct Count 290456
Approximate Unique (%) 71.8%
Missing 1
Missing (%) 0.0%
Memory Size 51178824

Length

Mean 59.5369
Standard Deviation 29.9405
Median 52
Minimum 1
Maximum 623

Sample

1st row What is the step b...
2nd row What is the story ...
3rd row How can I increase...
4th row Why am I mentally ...
5th row Which one dissolve...

Letter

Count 19224859
Lowercase Letter 18127812
Space Separator 4020499
Uppercase Letter 1097047
Dash Punctuation 17915
Decimal Number 164204
  • question1 contains many words: 78821 words

question2

categorical

Approximate Distinct Count 299174
Approximate Unique (%) 74.0%
Missing 2
Missing (%) 0.0%
Memory Size 51316133

Length

Mean 60.1087
Standard Deviation 33.8637
Median 51
Minimum 1
Maximum 1169

Sample

1st row What is the step b...
2nd row What would happen ...
3rd row How can Internet s...
4th row Find the remainder...
5th row Which fish would s...

Letter

Count 19344492
Lowercase Letter 18227102
Space Separator 4117742
Uppercase Letter 1117390
Dash Punctuation 18365
Decimal Number 169561
  • question2 contains many words: 74145 words

is_duplicate

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 26683140
  • The largest value (0) is over 1.71 times larger than the second largest value (1)

Length

Mean 1
Standard Deviation 0
Median 1
Minimum 1
Maximum 1

Sample

1st row 0
2nd row 0
3rd row 0
4th row 0
5th row 0

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 404290
  • The top 2 categories (0, 1) take over 50.0%
  • The largest value (0) is over 1.71 times larger than the second largest value (1)
  • is_duplicate has words of constant length

Interactions

Correlations

Missing Values